Skip to content

BUG: ensure reindex / getitem to select columns properly copies data for extension dtypes #51197

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Feb 6, 2023

I encountered this while writing more tests for Copy-on-Write. Currently, the general rule is that selecting columns with a list-like indexer using getitem gives a copy:

df = pd.DataFrame(np.random.randn(10, 4), columns=['a', 'b', 'c', 'd'])
subset = df[["a", "b"]]
# subset is a copy
subset.iloc[0, 0] = 0
assert df.iloc[0, 0] != 0

However, that doesn't seem to be the case when the columns we select are extension dtypes. When using dtype="Float64" in the above example, the original df gets updated because subset isn't a copy.

While I am not sure this is an explicitly documented rule (AFAIK this is de-facto behaviour, and only described as such in the discussions related to copy/view and CoW), I do think it would be expected that extension dtypes behave the same as numpy dtypes on this front.
(it also makes writing tests for copy/view behaviour harder if the behaviour doesn't only change for CoW or not, but also depending on numpy vs extension dtypes. This is where I encountered the issue)

@jbrockmendel
Copy link
Member

+1. This seems like the kind of thing that I'd usually want to call a bugfix and you'd want to call a breaking change. Since its for 2.0 i'd be fine documenting it either way.

@jorisvandenbossche jorisvandenbossche added ExtensionArray Extending pandas with custom dtypes or arrays. Copy / view semantics Bug labels Feb 10, 2023
@jorisvandenbossche jorisvandenbossche added this to the 2.0 milestone Feb 10, 2023
@jorisvandenbossche jorisvandenbossche marked this pull request as ready for review February 10, 2023 18:57
@jorisvandenbossche
Copy link
Member Author

Documented as a bug fix.

We could also leave it as is, and delay the behaviour change to pandas 3.0 when Copy-on-Write would be enabled. That has the benefit of not having the performance hit in the meantime of making the copy, but the drawback that we keep the wrong behaviour for a longer time (giving more people the chance to rely on its side-effects). I would personally go for the correct behaviour right now.

@jorisvandenbossche jorisvandenbossche changed the title BUG: ensure getitem to select columns properly copies data for extension dtypes BUG: ensure reindex / getitem to select columns properly copies data for extension dtypes Feb 10, 2023
Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this is a fix, should do for 2.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Copy / view semantics ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants